AIML Capstone Project 2022: ====> Group 2 Computer Vision 1

Pneumonia Detection Challenge

What is Pneumonia?

Pneumonia is an infection in one or both lungs. Bacteria, viruses, and fungi cause it. The infection causes inflammation in the air sacs in your lungs, which are called alveoli.

The alveoli fill with fluid or pus, making it difficult to breathe.Pneumonia is a lung infection that can range from mild to so severe that you have to go to the hospital.

Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2017, 920,000 children under the age of 5 died from the disease. It requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs such as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post-radiation or surgical changes. Outside of the lungs, fluid in the pleural space (pleural effusion) also appears as increased opacity on CXR. When available, comparison of CXRs of the patient taken at different time points and correlation with clinical symptoms and history are helpful in making the diagnosis.

CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR, complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shift.

Pneumonia Detection

Now to detect Pneumonia, we need to detect Inflammation of the lungs. In this project, you’re challenged to build an algorithm to detect a visual signal for pneumonia in medical images. Specifically, your algorithm needs to automatically locate lung opacities on chest radiographs.

How Is Pneumonia Diagnosed?

Sometimes pneumonia can be difficult to diagnose because the symptoms are so variable, and are often very similar to those seen in a cold or influenza. To diagnose pneumonia, and to try to identify the germ that is causing the illness, your doctor will ask questions about your medical history, do a physical exam, and run some tests.

Medical history

Your doctor will ask you questions about your signs and symptoms, and how and when they began. To help figure out if your infection is caused by bacteria, viruses or fungi, you may be asked some questions about possible exposures, such as:

Physical exam

Your doctor will listen to your lungs with a stethoscope. If you have pneumonia, your lungs may make crackling, bubbling, and rumbling sounds when you inhale.

Diagnostic Tests

If your doctor suspects you may have pneumonia, they will probably recommend some tests to confirm the diagnosis and learn more about your infection. These may include:

1 Blood tests to confirm the infection and to try to identify the germ that is causing your illness.

2) Chest X-ray to look for the location and extent of inflammation in your lungs.

3) Pulse oximetry to measure the oxygen level in your blood. Pneumonia can prevent your lungs from moving enough oxygen into your bloodstream.

4) Sputum test on a sample of mucus (sputum) taken after a deep cough, to look for the source of the infection. If you are considered a high-risk patient because of your age and overall health, or if you are hospitalized, the doctors may want to do some additional tests, including:

5) CT scan of the chest to get a better view of the lungs and look for abscesses or other complications.

6) Arterial blood gas test, to measure the amount of oxygen in a blood sample taken from an artery, usually in your wrist. This is more accurate than the simpler pulse oximetry.

7) Pleural fluid culture, which removes a small amount of fluid from around tissues that surround the lung, to analyze and identify bacteria causing the pneumonia.

8) Bronchoscopy, a procedure used to look into the lungs' airways. If you are hospitalized and your treatment is not working well, doctors may want to see whether something else is affecting your airways, such as a blockage. They may also take fluid samples or a biopsy of lung tissue.

Business Domain Value

Automating Pneumonia screening in chest radiographs, providing affected area details through bounding box.

Assist physicians to make better clinical decisions or even replace human judgement in certain functional areas of healthcare (eg, radiology).

Guided by relevant clinical questions, powerful AI techniques can unlock clinically relevant information hidden in the massive amount of data, which in turn can assist clinical decision making.

Image DataType

Medical images are stored in a special format called DICOM files (*.dcm). They contain a combination of header metadata as well as underlying raw image arrays for pixel data.

Dataset link: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data

Prediction Output

In this project, we have to predict whether pneumonia exists in a given image. This is done by predicting bounding boxes around areas of the lung. Samples without bounding boxes are negative and contain no definitive evidence of pneumonia. Samples with bounding boxes indicate evidence of pneumonia.

When making predictions, the model should predict as many bounding boxes as necessary, in the format: confidence x-min y-min width height

There will be only ONE predicted row per image. This row may include multiple bounding boxes.

A properly formatted row may look like any of the following.

For patientIds with no predicted pneumonia / bounding boxes: 0004cfab-14fd-4e49-80ba-63a80b6bddd6,

For patientIds with a single predicted bounding box: 0004cfab-14fd-4e49-80ba-63a80b6bddd6,0.5 0 0 100 100

For patientIds with multiple predicted bounding boxes: 0004cfab-14fd-4e49-80ba-63a80b6bddd6,0.5 0 0 100 100 0.5 0 0 100 100, etc.

The general format is as follows:

patientId,{confidence x-min y-min width height},{confidence x-min y-min width height}, etc.

Dataset File Description

stage_2_train_labels.csv - the training set. Contains patientIds and bounding box / target information.

stage_2_detailed_class_info.csv - provides detailed information about the type of positive or negative class for each image.

Data Fields

Lung Opacity

Tissues with sparse material, such as lungs which are full of air, do not absorb the X-rays and appear black in the image. Dense tissues such as bones absorb X-rays and appear white in the image.

While we are theoretically detecting “lung opacities”, there are lung opacities that are not pneumonia related.

In the data, some of these are labeled “Not Normal No Lung Opacity”.This extra third class indicates that while pneumonia was determined not to be present, there was nonetheless some type of abnormality on the image and often times this finding may mimic the appearance of true pneumonia.

It's important to note that the various shades of gray in the chest X-Ray refer to the following:

In a normal image (shown above) we see the lungs as black, but they have different projections on them - mainly the rib cage bones, main airways, blood vessels and the heart.

In case of pneumonia, a haziness (also referred to as consolidation) is present in the chest x-ray image.

Images with no lung opacity and no pneumonia are images where the patient can have rounded hazy boundaries or masses (probably because of lung nodules or masses which can be because of cancer).

There are other exceptional cases as well where there can be no lung opacity but no pneumonia either. Some of these cases include pneumonectomy (lung removed by surgery), enlarged heart, pleural effusion, etc.

Reference: https://www.kaggle.com/zahaviguy/what-are-lung-opacities

Pre-Processing, Data Visualization, EDA

Compare the labels and class information for possible join

Data Inference: A join or merge should typically give us a dataset that has a shape of (30227) assuming we keep all rows and drop the redundant 'patientId' column.

Check uniqueness of the data

Approach:

There could be duplicate patientID entries, resulting to multiple bounding boxes with relative target classification/class information. Compare if the sequence of records are synchronous between the "train labels" and "class information" datasets. If synchronous then a simple join can be performed on the index

Exploring Train Labels Dataset

Inference:

  1. There are 23,286 unique entries and 3,266 entries that have duplicates. Furthermore, 119 patients have triplicate entries and 13 patients with quadruplicate entries.
  1. Unique entries are 26,684. This means there are 26,684 patients in the training sample. Therefore, if the data is consistent, we should have the same number of entries in the "class information" dataset and the same number of training images.

Exploring Class Information Dataset

Merging Data

Shape of Dataset

1. Class Distribution

Observation:

The above graph above shows total numbers of records for the different classes. The graph shows that patients with No Lung Opacity/ Not Normal are highest number as compared to those with Lung Opacity and who are Normal patients.

8,851 (29.3%) records does not have any desease.

9,555 (31.6%) records has Lung Opacity.

11,821 (39.1%) records hs No Lung Opacity / Not Normal.

2. Target to Class

Observation:

From the above graph, we can infer that 31.6% of people in the daset have got Pneumonia and 68.4% do not have Pneumonia.

3. Impact of patient's age on pneumonia

Observation:

Observation:

4. Relation between patient's age and different classes

Normal Data Distribution

Observation:

No Lung Opacity / Not Normal Data Distribution

Observation:

Lung Opacity Data Distribution

Observation:

5. Data Distribution on Patient Gender

Observation:

6. Different classes as per Patient Gender

Observation:

7. Impact on Age and Gender

Observation:

8. Age and Pneumonia relation

Observation:

9. View position on different features

Class

Observation:

Target

Observation:

Gender Wise

Observation:

Age wise

Observation:

10. Image Loading - Plotting different X-Rays

Model Building

Data Processing

Extract Data from DICOM file

Splitting Data into relative classes

1. VGG19

Save the Model

Plotting Accuracy and Validation Accuracy

Plotting Loss and Validation Loss

Model testing

Testing with saved Model

AUC Curve

Prediction

Prediction on Test Image

2. VGG16
Compile Model
Train the Model

Saving the Model

Plotting accuracy and validation accuracy

Plotting Loss and Validation Loss

Model Testing

Testing with saved Model

ROC Curve

3. ResNet50

Model Compilation

Model Training

Saving the Model

Plotting & Validation Accuracy

Plotting & Validation Loss

Model Testing

Testing with saved Model

Classification Report and ROC Curve

4. InceptionNet v3

Model Compilation

Model Training

Saving the Model

Plotting & Validation Accuracy

Plotting & Validation Loss

Model testing

Testing with saved Model

Classification report & ROC Curve

5. YOLO v3

Locate the position of inflammation in an image

1. Clone and Build YOLOv3

2. Data Migration for YOLOv3

2.1. Make subdirectories

2.1. Load train labels.csv

2.2. Generate images and labels for training YOLOv3

2.3. Plot a sample train image and label

2.4. Generate train and validation file path lists (.txt)

2.5. Create test image and labels for YOLOv3

2.6. Plot a sample test Image

3. Prepare Configuration Files for Using YOLOv3

darknet53.conv.74 (Download Pre-trained Model)

4. Training YOLOv3

4.0. Training with Pre-trained CNN Weights (darknet53.conv.74)

4.1. Plots of Training Loss

5. Trainined YOLOv3 for test images

5.1. Load trained model (at 15300 iterations)

5.2. cfg file for testing

Detection using a Pre-Trained Model

Prediction 1

Prediction 2

Summary

Results:

Classification

  1. VGG19: Validation Accuracy - 82% and AUC-ROC – 82.6%.
  2. VGG16: Validation Accuracy - 82% and AUC-ROC – 87.8%.
  3. ResNet50: Validation Accuracy - 73% and AUC-ROC – 80.9%.
  4. InceptionNet v3: Validation Accuracy - 79% and AUC-ROC – 87%.

Locating the position of inflammation in an image

YOLO v3

Limitations:

Way Forward/ Future Work:

References:

[1] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview

[2] https://keras.io/api/applications/vgg/

[3] https://www.kaggle.com/keras/resnet50

[4] https://keras.io/api/applications/inceptionv3/

[5] YOLOv3: An Incremental Improvement, Joseph Redmon, Ali Farhadi, University of Washington